library(dplyr)
library(ggplot2)
library(inlabru)
library(INLA)
library(terra)
library(sf)
library(scico)
library(magrittr)
library(patchwork)
library(tidyterra)
# We want to obtain CPO data from the estimations
bru_options_set(control.compute = list(dic = TRUE,
waic = TRUE,
mlik = TRUE,
cpo = TRUE))Practical - Dealing with excess zeros
In this practical we are going to work with data with excess zeros. We will
Fit a zero inflated model
Fit a hurdle model
Fit a hurdle model using two likelihoods and a shared component
1 Data Preparation
The following example use the gorillas dataset available in the inlabru library.
The data give the locations of Gorilla’s nests in an area:
gorillas_sf <- inlabru::gorillas_sf
nests <- gorillas_sf$nests
boundary <- gorillas_sf$boundary
ggplot() + geom_sf(data = nests) +
geom_sf(data = boundary, alpha = 0)The dataset also contains covariates in the form or raster data. We consider two of them here:
gcov = gorillas_sf_gcov()
elev_cov <- gcov$elevation
dist_cov <- gcov$waterdistNote: the covariates have been expanded to cover all the nodes in the mesh.
To obtain the count data, we rasterize the species counts to match the spatial resolution of the covariates available. Then we aggregate the pixels to a rougher resolution (5x5 pixels in the original covariate raster dimensions). Finally, we mask regions outside the study area.
In addition we compute the area of each grid cell.
# Rasterize data
counts_rstr <-
terra::rasterize(vect(nests), gcov, fun = sum, background = 0) %>%
terra::aggregate(fact = 5, fun = sum) %>%
mask(vect(sf::st_geometry(boundary)))
plot(counts_rstr)# compute cell area
counts_rstr <- counts_rstr %>%
cellSize(unit = "km") %>%
c(counts_rstr)To create our dataset of counts, we extract also the coordinate of center point of each raster pixel. In addition we create a column with presences and one with the pixel area
counts_df <- crds(counts_rstr, df = TRUE, na.rm = TRUE) %>%
bind_cols(values(counts_rstr, mat = TRUE, na.rm = TRUE)) %>%
rename(count = sum) %>%
mutate(present = (count > 0) * 1L) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(nests))We then aggregate the covariates to the same resolution as the nest counts and scale them.
elev_cov1 <- elev_cov %>%
terra::aggregate(fact = 5, fun = mean) %>% scale()
dist_cov1 <- dist_cov %>%
terra::aggregate(fact = 5, fun = mean) %>% scale()2 Mesh building
We now define the mesh and the spde object.
mesh <- fm_mesh_2d(
loc = st_as_sfc(counts_df),
max.edge = c(0.5, 1),
crs = st_crs(counts_df)
)
matern <- inla.spde2.pcmatern(mesh,
prior.sigma = c(1, 0.01),
prior.range = c(5, 0.01)
)In our dataset, the number of zeros is quite substantial, and our model may struggle to account for them adequately. To address this, we should select a model capable of handling an “inflated” number of zeros, exceeding what a standard Poisson model would imply. For this purpose, we opt for a “zero-inflated Poisson model,” commonly abbreviated as ZIP.
3 Zero-Inflated model (Type1)
We fit now a Zero-Inflated model to our data.
The Type 1 Zero-inflated Poisson model is defined as follows:
\[ \text{Prob}(y\vert\dots)=\pi\times 1_{y=0}+(1-\pi)\times \text{Poisson}(y) \]
Here, \(\pi=\text{logit}^{-1}(\theta)\)
The expected value and variance for the counts are calculated as:
\[ \begin{gathered} E(count)=(1-\pi)\lambda \\ Var(count)= (1-\pi)(\lambda+\pi \lambda^2) \end{gathered} \tag{1}\]
This model has two parameters:
- The probability of excess zero \(\pi\) - This is a hyperparameter and therefore it is constant
- The mean of the Poisson distribution \(\lambda\). This is linked to the linear predictor as: \[ \eta = E\log(\lambda) = \log(E) + \beta_0 + \beta_1\text{Elevation} + \beta_2\text{Distance } + u \] where \(\log(E)\) is an offset (the area of the pixel) that accounts for the size of the cell.
Once the model is fitted we can look at the results
4 Hurdle model (Type0)
We now fit a hurdle model to the same data.
In the zeroinflatedpoisson0 model is defined by the following observation probability model
\[ \text{Prob}(y\vert\dots)=\pi\times 1_{y=0}+(1-\pi)\times \text{Poisson}(y\vert y>0) \]
where \(\pi\) is the probability of zero.
The expectation and variance of the counts are as follows:
\[ \begin{aligned} E(\text{count})&=\frac{1}{1-\exp(-\lambda)}\pi\lambda \\ Var(\text{count})&= E(\text{count}) \left(1-\exp(-\lambda) E(\text{count})\right) \end{aligned} \tag{2}\]
5 Hurdle model using two likelihoods
Here the model is the same as in Section 4, but this time we also want to model \(\pi\) using covariates and random effects. Therefore we define a second linear predictor \[ \eta^2 =\beta_0^2 + \beta_1^2\text{Elevation} + \beta_2^2\text{Distance} + u^2 \] Note here we have defined the two linear predictor to use the same covariates, but this is not necessary, they can be totally independent.
To fit this model we have to define two likelihoods: - One will account for the presenceprocess and has a Binomial model - One will account for the counts and has a truncated Poisson model
7 Comparing models
We have fitted four different models. Now we want to compare them and see how they fit the data.
7.1 Comparing model predictions
We first want to compare the estimated surfaces of expected counts. To do this we want to produce the estimated expected counts, similar to what we did in Section 3 and Section 4 for all four models and plot them together:
pred_zip <- predict(
fit_zip,
counts_df,
~ {
pi <- zero_probability_parameter_for_zero_inflated_poisson_1
lambda <- area * exp( distance + elevation + space + Intercept)
expect <- (1-pi) * lambda
variance <- (1-pi) * (lambda + pi * lambda^2)
list(
expect = expect
)
},n.samples = 2500)
pred_zap <- predict( fit_zap, counts_df,
~ {
pi <- zero_probability_parameter_for_zero_inflated_poisson_0
lambda <- area * exp( distance + elevation + space + Intercept)
expect <- ((1-exp(-lambda))^(-1) * pi * lambda)
list(
expect = expect)
},n.samples = 2500)
inv.logit = function(x) (exp(x)/(1+exp(x)))
pred_zap2 <- predict( fit_zap2, counts_df,
~ {
pi <- inv.logit(Intercept_presence + elev_presence + dist_presence + space_presence)
lambda <- area * exp( dist_count + elev_count + space_count + Intercept_count)
expect <- ((1-exp(-lambda))^(-1) * pi * lambda)
list(
expect = expect)
},n.samples = 2500)
pred_zap3 <- predict( fit_zap3, counts_df,
~ {
pi <- inv.logit(Intercept_presence + elev_presence + dist_presence + space_copy)
lambda <- area * exp( dist_count + elev_count + space + Intercept_count)
expect <- ((1-exp(-lambda))^(-1) * pi * lambda)
list(
expect = expect)
},n.samples = 2500)
data.frame(x = st_coordinates(counts_df)[,1],
y = st_coordinates(counts_df)[,2],
zip = pred_zip$expect$mean,
hurdle = pred_zap$expect$mean,
hurdle2 = pred_zap2$expect$mean,
hurdle3 = pred_zap3$expect$mean) %>%
pivot_longer(-c(x,y)) %>%
ggplot() + geom_tile(aes(x,y, fill = value)) + facet_wrap(.~name) +
theme_map + scale_fill_scico(direction = -1)7.2 Model Comparison
We have set control.compute=list(cpo = T, waic = T, dic = T, mlik = T) before running the models, therefore we can compare the scores for the four models:
scores = data.frame(
names = c("zip", "hurdle", "hurdle2", "hurdle3"),
dic = c(fit_zip$dic$dic, fit_zap$dic$dic, fit_zap2$dic$dic, fit_zap3$dic$dic),
waic = c(fit_zip$waic$waic, fit_zap$waic$waic, fit_zap2$waic$waic, fit_zap3$waic$waic),
mlik = c(fit_zip$mlik[1], fit_zap$mlik[1], fit_zap2$mlik[1], fit_zap3$mlik[1]))
scores
#> names dic waic mlik
#> 1 zip 1214.132 1223.710 -686.9500
#> 2 hurdle 1886.065 1909.721 -994.3807
#> 3 hurdle2 1268.307 1285.508 -734.3704
#> 4 hurdle3 1699.393 2645.488 -858.2067From the table above we see that, when looking at a balance between complexity and fit, the zero inflated model (zip) seems to be the best one.
The variance in count predictions can be obtained from the posterior predictions of expectations and variances that were previously computed for each grid box. Let’s denote the count expectation in each grid box as \(m_i\), and the count variance as \(s_i^2\), both conditioned on the model predictor \(\eta_i\). Then, the posterior predictive variance of the count \(X_i\) is given by:
\[ \begin{aligned} V(X_i) &= E(V(X_i|\eta_i)) + V(E(X_i|\eta_i)) \\ &= E(s_i^2) + V(m_i) . \end{aligned} \]
This equation provides the posterior predictive variance of the count \(X_i\) based on the expectations and variances of the model predictions \(m_i\) and \(s_i^2\) for each grid box, conditioned on the model predictor \(\eta_i\).
Predictive Integral Transform (PIT) [@marshall2003ApproximateCrossvalidatoryPredictive;@gelman1996PosteriorPredictiveAssessment;@held2010PosteriorCrossvalidatoryPredictive] is calculated as the cumulative distribution function (CDF) of the observed data at each predicted value from the model. Mathematically, for each observation \(y_i\) and its corresponding predicted value \(\hat{y}_i\) from the model, the PIT is calculated as follows:
\[PIT_i=P(Y_i\leq \hat y_i \vert data)\]
where \(Y_i\) is a random variable representing the observed data for the \(i\)th observation.
The PIT measures how well the model’s predicted values align with the distribution of the observed data. Ideally, if the model’s predictions are perfect, the PIT values should follow a uniform distribution between 0 and 1. Deviations from this uniform distribution may indicate issues with model calibration or overfitting. It’s often used to assess the reliability of model predictions and can be visualized through PIT histograms or quantile-quantile (Q-Q) plots.
The Conditional Predictive Ordinate (CPO) [@pettit1990ConditionalPredictiveOrdinate] is calculated as the posterior probability of the observed data at each observation point, conditional on the rest of the data and the model. For each observation \(y_i\), it is computed as:
\[CPO_i=P(y_i\vert data \setminus y_i, model)\]
where \(\setminus\) means “other than”, so that \(P(y_i | \text{data}\setminus y_i, \text{model})\) represents the conditional probability given all other observed data and the model.
CPO provides a measure of how well the model predicts each individual observation while taking into account the rest of the data and the model. A low CPO value suggests that the model has difficulty explaining that particular data point, whereas a high CPO value indicates a good fit for that observation. In practice, CPO values are often used to identify influential observations, potential outliers, or model misspecification. When comparing models, the following summary of the CPO is often used:
\[-\sum_{i=1}^n\log(CPO_i)\] where smaller values indicate a better model fit.
zap_pit <- rep(NA_real_, nrow(counts_df))
zap_pit[counts_df$count > 0] <- fit_zap$cpo$pit[-seq_len(nrow(counts_df))]
df <- data.frame(
count = rep(counts_df$count, times = 3),
pred_mean = c(
expect_poi$mean,
expect_zip$mean,
expect_zap$mean
),
pred_var = c(
expect_poi$pred_var,
expect_zip$pred_var,
expect_zap$pred_var
),
log_score = c(
expect_poi$log_score,
expect_zip$log_score,
expect_zap$log_score
),
pit = c(
fit_poi$cpo$pit * c(NA_real_, 1)[1 + (counts_df$count > 0)],
fit_zip$cpo$pit * c(NA_real_, 1)[1 + (counts_df$count > 0)],
zap_pit
),
Model = rep(c("Poisson", "ZIP", "ZAP"), each = nrow(counts_df))
)
p1 <- ggplot(df) +
geom_point(aes(pred_mean, count - pred_mean, color = Model)) +
ggtitle("Residuals")
p2 <- ggplot(df) +
stat_ecdf(aes(pit, color = Model), na.rm = TRUE) +
scale_x_continuous(expand = c(0, 0)) +
ggtitle("PIT")
patchwork::wrap_plots(p1, p2, nrow = 1, guides = "collect")